Missing value treatment.¶

#Import modules.

import pandas as pd
import numpy as np

#Form the dataframe.
data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
        'age': [42, np.nan, 36, 24, 73], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

Drop missing observations¶

df_no_missing = df.dropna()
df_no_missing

Drop those rows where all record in that row is NA¶

df_cleaned = df.dropna(how='all')
df_cleaned

Let's create a new column called 'Location' with full of missing values¶

df['location'] = np.nan
df

Drop those columnwho have a missing values¶

df.dropna(axis=1, how='all')

Drop those rows that contain less than five observations¶

df.dropna(thresh=5)

Fill in missing data with zeros¶

df.fillna(0)

Fill the missing values with mean()¶

df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df

df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df

# Select the rows of df where age is not NaN and sex is not NaN
df[df['age'].notnull() & df['sex'].notnull()]

How Replace() work¶

#Lets create a new data set to explore replace functions
data = pd.Series([1,2,-99,4,5,-99,7,8,-99])
data

0     1
1     2
2   -99
3     4
4     5
5   -99
6     7
7     8
8   -99
dtype: int64

# Replace the placeholder -99 as NaN
data.replace(-99, np.nan)

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
5    NaN
6    7.0
7    8.0
8    NaN
dtype: float64

You will no longer see the -99, because it is replaced by NaN and hence not shown.

	first_name	last_name	age	sex	preTestScore	postTestScore
0	Jason	Miller	42.0	m	4.0	25.0
1	0	0	0.0	0	0.0	0.0
2	Tina	Ali	36.0	f	0.0	0.0
3	Jake	Milner	24.0	m	2.0	62.0
4	Amy	Cooze	73.0	f	3.0	70.0

Missing value treatment.¶

Drop missing observations¶

Drop those rows where all record in that row is NA¶

Let's create a new column called 'Location' with full of missing values¶

Drop those columnwho have a missing values¶

Drop those rows that contain less than five observations¶

Fill in missing data with zeros¶

Fill the missing values with mean()¶

How Replace() work¶

Happy learning....¶